TOPO-X: Co-optimize Flow Scheduling Topology and ML Training Parallelism

, , , ,

The rapid advancement of large-scale deep neural networks and large language models has intensified the demand for highly efficient GPU clusters. However, existing distributed training frameworks, like Fat-tree and TopoOpt, struggle with inefficient resource utilization and network bottlenecks. They often optimize communication, parallelism, and network topology independently, failing to leverage their interdependencies. To address this gap, we propose TOPO-X, a novel reconfigurable network framework that co-optimizes flow scheduling, training parallelism, and optical network topology. By formulating this integrated optimization challenge as a Resource-Constrained Project Scheduling Problem, TOPO-X dynamically adapts to changing workloads and network conditions using optical network reconfiguration capabilities. Our experimental results show that TOPO-X outperforms the state-of-the-art solution, TopoOpt, achieving a 2.22x speedup in training iteration times on average. These findings highlight TOPO-X as a promising approach for scalable, adaptive, and high-performance GPU clusters designed to meet the increasing demands of large-scale AI training workloads.

» Read on
Yi-Xiang Hu, Han Tian, Yifang Zhao, Feng Wu, Xiang-Yang Li. TOPO-X: Co-optimize Flow Scheduling Topology and ML Training Parallelism. In Proceedings of the 34th International Conference on Computer Communications and Networks (ICCCN), Tokyo, Japan, August 2025.
Save as file
@inproceedings{HTZicccn25,
 address = {Tokyo, Japan},
 author = {Yi-Xiang Hu and Han Tian and Yifang Zhao and Feng Wu and Xiang-Yang Li},
 booktitle = {Proceedings of the 34th International Conference on Computer Communications and Networks (ICCCN)},
 doi = {10.1109/ICCCN65249.2025.11133974},
 month = {August},
 title = {TOPO-X: Co-optimize Flow Scheduling Topology and ML Training Parallelism},
 year = {2025}
}